自我介绍在训练过程中利用自身的非均匀软监管,并在没有任何运行时成本的情况下提高性能。但是,在训练过程中的开销经常被忽略,但是在巨型模型的时代,培训期间的时间和记忆开销越来越重要。本文提出了一种名为ZIPF标签平滑(ZIPF的LS)的有效自我验证方法,该方法使用网络的直立预测来生成软监管,该软监管在不使用任何对比样本或辅助参数的情况下符合ZIPF分布。我们的想法来自经验观察,即当对网络进行适当训练时,在按样品的大小和平均分类后,应遵循分布的分布,让人联想到ZIPF的自然语言频率统计信息,这是在按样品中的大小和平均值进行排序之后进行的。 。通过在样本级别和整个培训期内强制执行此属性,我们发现预测准确性可以大大提高。使用INAT21细粒分类数据集上的RESNET50,与香草基线相比,我们的技术获得了 +3.61%的准确性增长,而与先前的标签平滑或自我验证策略相比,增益增加了0.88%。该实现可在https://github.com/megvii-research/zipfls上公开获得。
translated by 谷歌翻译
本文旨在解释刚被二进制标签监督时,深泡检测模型如何学习图像的人工制品特征。为此,从图像匹配的角度提出了三个假设,如下所示。 1. DeepFake检测模型指出了基于既不是与源相关又不相关的视觉概念的真实/假图像,也就是说,考虑到与伪影这样的视觉概念。 2.除了对二进制标签的监督外,DeepFake检测模型还通过训练集中的FST匹配(即匹配的伪造,源,目标图像)隐含地学习与伪影相关的视觉概念。 3.通过原始训练集中的FST匹配,隐式学习的人工构图概念容易受到视频压缩的影响。在实验中,在各种DNN中验证了上述假设。此外,基于这种理解,我们提出了FST匹配的DeepFake检测模型,以提高压缩视频中伪造检测的性能。实验结果表明,我们的方法实现了出色的性能,尤其是在高度压缩的(例如C40)视频上。
translated by 谷歌翻译
为了获得更高的样本效率和同时获得卓越的最终性能,这是深入增强学习(DRL)的主要挑战之一。以前的工作可以应对这些挑战之一,但通常未能同时解决这些挑战。在本文中,我们尝试同时解决这两个挑战。为了实现这一目标,我们首先将这些挑战分解为两个经典的RL问题:数据丰富性和探索探索权衡取舍。然后,我们将这两个问题投入到培训数据分配优化问题中,即在有限的互动中获得所需的培训数据,并通过i)同时解决这些数据,并通过i)明确的建模和控制行为政策的能力和多样性以及ii)。使用单调数据分布优化对行为策略的选择性/采样分布的粒度和自适应控制。最后,我们将此过程集成到广义策略迭代(GPI)中,并获得一个更通用的框架,称为广义数据分布迭代(GDI)。我们使用GDI框架来介绍从DQN到Agent57的著名RL方法的基于操作的版本。总结了GDI优势与GPI的理论保证。我们还展示了我们在街机学习环境(ALE)方面的最先进(SOTA)表现,其中我们的算法达到了9620.33%的平均人类正常得分(HNS),1146.39%的中位数HNS,仅使用22种人类世界记录超越200m培训框架。我们的性能与Agent57相当,而我们消耗了500倍的数据。我们认为,在获得啤酒中真正的超人人类代理人之前,还有很长的路要走。
translated by 谷歌翻译
街机学习环境(ALE)被提出作为凭经验评估跨越atari 2600游戏的代理的一般性的评估平台。 ALE提供各种具有挑战性的问题,并从深加固学习(RL)社区中汲取了重要的关注。从Deep Q-Networks(DQN)到Agent57,RL代理似乎在ALE中实现了超人的性能。但是,这是这种情况吗?在本文中,为了探讨这个问题,我们首先审查了Atari基准中的当前评估指标,然后揭示了实现超人表现的当前评估标准是不合适的,这使得人类的性能相对于可能的是。为了处理这些问题并促进RL研究的发展,我们提出了一种基于人类世界纪录(HWR)的新型Atari基准,这对最终表现和学习效率提出了RL代理的更高要求。此外,我们总结了ATARI基准中的最先进的(SOTA)方法,并根据人类世界记录提供基准测试新的评估指标。我们得出结论,至少四个开放的挑战阻碍了RL代理商从那些新的基准结果实现超人绩效。最后,我们还讨论了一些有希望的方法来处理这些问题。
translated by 谷歌翻译
深度Q网络(DQN)通过将深度学习(DL)与加强学习(RL)组合,这已经注意到,已经注意到所获取的数据的分布在训练过程中将改变。 DQN发现此属性可能会导致培训不稳定,因此它提出了处理财产缺点的有效方法。而不是专注于不利的方面,我们发现RL为缓解估计的数据分布与地面真理数据分布之间的差距,而是在监督学习(SL)未能这样做的情况下是至关重要的。从这种新的角度来看,我们将称为广义策略迭代(GPI)的RL的基本范例扩展到更广泛的版本中,该版本称为广义数据分发迭代(GDI)。我们看到大规模的RL算法和技术可以统一到GDI范式,这可以被认为是GDI的特殊情况之一。我们提供理论证明,为什么GDI比GPI更好,以及它如何运作。已经提出了基于GDI的几种实用算法来验证它的有效性和广泛性。经验实验证明我们在街机学习环境(ALE)上的最先进的(SOTA)性能,其中我们的算法达到了9620.98%的平均值人类规范化得分(HNS),1146.39%的中位数HNS和22人的世界历史突破(HWRB )仅使用仅200M的训练框架。我们的工作旨在引领RL研究进入征服人类世界纪录的旅程,并在表现和效率上寻求真正的超人代理。
translated by 谷歌翻译
我们研究了无模型增强学习的问题,该问题通常按照广义政策迭代(GPI)的原则解决。尽管GPI通常是策略评估和策略改进之间的相互作用,但大多数传统的无模型方法都假定粒度的独立性和GPI步骤的其他细节,尽管它们之间存在固有的联系。在本文中,我们提出了一种方法,该方法使政策评估和策略改进之间的不一致性正常,从而导致冲突的GPI解决方案,并减少了功能近似错误。为此,我们制定了一种新颖的学习范式,其中采取政策评估步骤等同于对执行政策改进的一些补偿,从而有效地减轻了两个GPI步骤之间的梯度冲突。我们还表明,我们提出的解决方案的形式等同于执行熵登记的策略改进,因此阻止该政策被困在次优的解决方案中。我们进行了广泛的实验,以评估我们在街机学习环境(ALE)方面的方法。经验结果表明,我们的方法在主要评估领域的表现优于几个强基础。
translated by 谷歌翻译
In recent years, graph representation learning has achieved remarkable success while suffering from low-quality data problems. As a mature technology to improve data quality in computer vision, data augmentation has also attracted increasing attention in graph domain. For promoting the development of this emerging research direction, in this survey, we comprehensively review and summarize the existing graph data augmentation (GDAug) techniques. Specifically, we first summarize a variety of feasible taxonomies, and then classify existing GDAug studies based on fine-grained graph elements. Furthermore, for each type of GDAug technique, we formalize the general definition, discuss the technical details, and give schematic illustration. In addition, we also summarize common performance metrics and specific design metrics for constructing a GDAug evaluation system. Finally, we summarize the applications of GDAug from both data and model levels, as well as future directions.
translated by 谷歌翻译
The understanding capabilities of current state-of-the-art 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories. In its 2D counterpart, recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language. Inspired by this, leveraging multimodal information for 3D modality could be promising to improve 3D understanding under the restricted data regime, but this line of research is not well studied. Therefore, we introduce ULIP to learn a unified representation of image, text, and 3D point cloud by pre-training with object triplets from the three modalities. To overcome the shortage of training triplets, ULIP leverages a pre-trained vision-language model that has already learned a common visual and textual space by training with massive image-text pairs. Then, ULIP learns a 3D representation space aligned with the common image-text space, using a small number of automatically synthesized triplets. ULIP is agnostic to 3D backbone networks and can easily be integrated into any 3D architecture. Experiments show that ULIP effectively improves the performance of multiple recent 3D backbones by simply pre-training them on ShapeNet55 using our framework, achieving state-of-the-art performance in both standard 3D classification and zero-shot 3D classification on ModelNet40 and ScanObjectNN. ULIP also improves the performance of PointMLP by around 3% in 3D classification on ScanObjectNN, and outperforms PointCLIP by 28.8% on top-1 accuracy for zero-shot 3D classification on ModelNet40. Our code and pre-trained models will be released.
translated by 谷歌翻译
What is a rose, visually? A rose comprises its intrinsics, including the distribution of geometry, texture, and material specific to its object category. With knowledge of these intrinsic properties, we may render roses of different sizes and shapes, in different poses, and under different lighting conditions. In this work, we build a generative model that learns to capture such object intrinsics from a single image, such as a photo of a bouquet. Such an image includes multiple instances of an object type. These instances all share the same intrinsics, but appear different due to a combination of variance within these intrinsics and differences in extrinsic factors, such as pose and illumination. Experiments show that our model successfully learns object intrinsics (distribution of geometry, texture, and material) for a wide range of objects, each from a single Internet image. Our method achieves superior results on multiple downstream tasks, including intrinsic image decomposition, shape and image generation, view synthesis, and relighting.
translated by 谷歌翻译
We present a new method for generating controllable, dynamically responsive, and photorealistic human animations. Given an image of a person, our system allows the user to generate Physically plausible Upper Body Animation (PUBA) using interaction in the image space, such as dragging their hand to various locations. We formulate a reinforcement learning problem to train a dynamic model that predicts the person's next 2D state (i.e., keypoints on the image) conditioned on a 3D action (i.e., joint torque), and a policy that outputs optimal actions to control the person to achieve desired goals. The dynamic model leverages the expressiveness of 3D simulation and the visual realism of 2D videos. PUBA generates 2D keypoint sequences that achieve task goals while being responsive to forceful perturbation. The sequences of keypoints are then translated by a pose-to-image generator to produce the final photorealistic video.
translated by 谷歌翻译